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1.3 Bayes decision theory. The distinguishing feature of Bayesian statistics is that a 
probabihty distribution tt, called a prior, is given on the parameter space (0,T). Some- 
times, priors are also considered which may be infinite, such as Lebesgue measure on the 
whole real line, but such priors will not be treated here at least for the time being. 

A Bayesian statistician chooses a prior tt based on whatever information on the un- 
known 9 is available in advance of making any observations in the current experiment. In 
general, no definite rules are prescribed for choosing tt. Priors are often useful as technical 
tools in reaching non-Bayesian conclusions such as admissibility in Theorems 1.2.5 and 
1.2.6. 

Bayes decision rules were defined near the end of the last section as rules which 
minimize the Bayes risk and for which the risk is finite. Bayes tests of P vs. Q, treated 
in Theorem 1.1.8, are a special case of Bayes decision rules. We saw in that case that 
Bayes rules need not be randomized (Remark 1.1.9). The same is true quite generally in 
Bayes decision theory: if, in a given situation, it is Bayes to choose at random among two 
or more possible decisions, then the decisions must have equal risks (conditional on the 
observations) and we may as well just take one of them. Theorem 1.3.1 will give a more 
precise statement. 

In game theory, randomization is needed to have a strategy that is optimal even if 
the opponent knows it and can choose a strategy accordingly. If one knows the opponent's 
strategy then it is not necessary to randomize. Sometimes, statistical decision theory is 
viewed as a game against an opponent called "Nature." Unlike an opponent in game 
theory, "Nature" is viewed as neutral, not trying to win the game. Assuming a prior, as 
in Bayes decision theory, is to assume in effect that "Nature" follows a certain strategy. 

In showing that randomization isn't needed, it will be helpful to formulate randomiza- 
tion in a fuller way, where we not only choose a probability distribution over the possible 
actions, but then also choose an action according to that distribution, in a measurable 
way, as follows: 

Definition. A randomized decision rule d : X ^ is realizable if there is a probability 
space (0,JF, |u) and a jointly measurable function 5 : X x Q ^ A such that for each 
X in X, 5(x,-) has distribution d{x), in other words d{x) is the image measure of by 
S{x,-), d{x) = n o d{x, ■)~^ . 

For example, a randomized test as in Sec. 1.1 is always a realizable rule, where we can 
take Q as the interval [0, 1] with Lebesgue measure and let 6{x, t) — dq iit < f{x) and dp 
otherwise. 

It is shown in the next section that decision rules are realizable under conditions wide 
enough to cover a great many cases, for example whenever the action space is a subset of 
a space M'^ with Borel cr-algebra. It will be shown next that randomization is unnecessary 
for realizable Bayes rules. The idea is that the Bayes risk of a realizable randomized Bayes 
rule d{-) is an average of Bayes risks of non-randomized rules 5(-,a;). Since a Bayes rule 
has minimum Bayes risk, the risks of 5{-,uj) are no smaller, so they must almost all be 
equal to that of d{-). Then such non-randomized for fixed uj are Bayes rules. 
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1.3.1 Theorem. For any decision problem for a measurable family {Pq, e Q} and prior 
TT, if there is a realizable Bayes randomized decision rule d, then there is a non- randomized 
Bayes decision rule. 

Proof. First, here is a helpful technical fact: 

1.3.2 Lemma. For any measurable family {Pe, 9 € &} and nonnegative, jointly 
measurable function / : {6,x,u) ^ f{9,x,u!), the function g defined by g{0,u!) := 
f f{0^x^u)dPe{x) is jointly measurable. 

Proof. If f{e,x,uj) = lT{e)lB{x)lF{oj) for some T e T, B e B and F e J^, then 
g{6,uj) = P6»(-B)1t(6')1f(i^) is measurable in {6,uj) since 9 i-^ Pe{B) is measurable by 
assumption. The rest of the proof of the Lemma is like that of Prop. 1.2.4. □ 

Now to prove Theorem 1.3.1, take (0,JF, //) and ■) as in the definition that d is 
realizable. For each fixed u E d{-, u) is a non-randomized decision rule. So r{n, d{-,u>)) > 
r{7r,d) since d is Bayes for tt. Also, writing ^{da) := dv{a) for a measure 

r{7r,d) = J r{9,d)d7r{9) = J j r{9,d{x))dPe{x)d'n{9) (by the definitions) 



^ j j j L{9,a)d{x){da)dPe{x)d7r{9) = J J J L{9,6{x,co))dfx{Lo)dP0{x)d7r{9) 

by the image measure theorem, e.g. RAP, 4.1.11. So by the Tonelli-Fubini theorem for 
nonnegative measurable functions, twice, and the measurability shown in Lemma 1.3.2, we 
get 

r{7r,d) = J J J L{9,5{x,u))dPe{x)dTT{9)dii{u) = J r{7r,6{-,co))di^{u). 

Thus, r{TT,S{-,u>)) = r{7v,d) for ^u-almost all uj, and so for some uj, providing a Bayes 
non-randomized decision rule S{-,u!). □ 

If every randomized rule is realizable, as is shown in the next section under conditions 
given there, then Theorem 1.3.1 shows that the non-randomized rules form an essentially 
complete class, as defined in Sec. 1.2. It will also be shown in Sec. 2.2 below that non- 
randomized rules are (essentially) complete under some other conditions. 

Definition. A family {Pg, 9 G 0} of laws on a measurable space {X,B) will be called 
dominated if for some a-finite measure v, each law Pq is absolutely continuous with respect 
to V, in other words for any A e B, v{A) = implies PeiA) = for all 9. 

Often, V would be Lebesgue measure on R^; or, if the measures were all concentrated 
on a countable set such as the integers, v would be counting measure (the measure giving 
mass 1 to each point) on the set. 
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If Pe is absolutely continuous with respect to v, then by the Radon-Nikodym theorem 
(RAP, 5.5.4), it has a density or Radon-Nikodym derivative f{6,x) := [dPg / dv) (x) . A 
cr-algebra B is called countably generated if there is a countable subcoUection C G B such 
that B is the smallest cr-algebra including C. In any separable metric space, the Borel cj- 
algebra is countably generated (taking C as the set of balls with rational radii and centers 
in a countable dense set). In the great majority of apphcations of statistics, sample spaces 
are separable metric spaces, in fact Euclidean spaces M.'^. At any rate, from here on it will 
be assumed that B is countably generated, unless something to the contrary is stated. 

1.3.3 Theorem. If {Pq, ^ e 6} is a dominated, measurable family on a sample space 
{X,B), for a parameter space (0,T) and a cr-finite measure then the density function 
f{0,x) = {dPg / dv){x) can be taken to be jointly measurable in 6 and x. 

Proof. Let Br-, r = 1, 2, . . . , be an increasing sequence of finite Boolean algebras of 
subsets of X whose union generates B. (Such algebras exist by the blanket assumption 
that B is countably generated.) There is a probability measure Q equivalent to (mutually 
absolutely continuous with) v: to see this, let X be a union of disjoint measurable sets Aj 
with < v{Aj) < oo, and for any B e B let Q{B) = Xljli ^{3 D Aj)/{2h{Aj)). So we 
can assume that t; is a probability measure. 

For each 9, let g{9, ■) := dPg/dv. Let gr{d, •) be the conditional expectation of g{9, •) 
given Br for v, gr{0, ■) := E{g{9, ■)\Br). This can be defined in either of two ways. One 
is that since Pq remains absolutely continuous with respect to v if both are restricted to 
Bri and gr{9,-) = dPg/dv (Radon-Nikodym derivative) for these restrictions to Br- The 
other is that Br is generated by a finite collection of atoms, which are non-empty sets 
A G Br oi which no proper, non-empty subset belongs to Br- Then for x in such an atom 
A, gr{0,x) = P0{A)/v{A), or if v{A) = then let gr{0,x) = 0. Let for z = 1,... ,/(r) 
be the atoms of Br- Then since {Pq, ^ G 0} is measurable, for each fixed x, gr{--,x) is 
measurable. There are only finitely many possibilities for this function, each for a; in a 
measurable set Bri, so gr is jointly measurable in 6 and x. 

Here a fact from probability theory will be used: for each fixed 6, the sequence gr{6, ■) 
of functions on X is a right-closed martingale (RAP, p. 283), with g^Q :— g, and gr{0, x) — > 
g{d, x) as r — > oo for Pgi-almost all x- 

The set on which a sequence of measurable real-valued functions converges is mea- 
surable (RAP, proof of Theorem 4.2.5). Let f{9,x) := limr^oo 9r{(^,x) whenever the 
limit exists and f{9,x) — otherwise. Then / is jointly measurable and for each 9, 
f{9, x) — g{9, x) almost surely for Pg, so f{9, •) is a density of Pg with respect to v- □ 

Under the hypotheses of Theorem 1.3.3, it will be assumed from here on that f{9,x) 
is jointly measurable in 9 and x- 

If {Pg, 9 e 0} is a dominated, measurable family of laws on x, with jointly measur- 
able densities q{9,x) with respect to some measure v, then the family of laws for n i.i.d. 
observations, {Pg : 9 G 0} on X", is clearly dominated and measurable, with jointly mea- 
surable densities f{9,x) = YVj=i Qi^^^j)- ^ dominated, measurable family and for a 
fixed X, f{-,x) is a function on called the likelihood function- The posterior distribution 
on given x is the law tTx having density with respect to tt given by f{-,x)/ f f{9, x)d7r{9), 
provided that the integral in the denominator is strictly positive and finite. In other words. 
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for any measurable set C of parameters, 



(1.3.4) 7r,(C) = J^f{e,x)dn{e)/ J f{e,x)dn{e). 

Here the denominator is a measurable function of x by joint measurability. By the Tonelli- 
Fubini theorem, / / f{6,x)dTi{9)dv{x) = 1, so / f{6,x)d7v{9) < oo for f-almost all x. If x 
is such that / f{9,x)dn{9) = 0, then the posterior given x is not defined. Observing such 
an X indicates that the prior and/or likelihood function are incorrectly specified. If before 
taking the observation the (Bayesian) statistician believed that 9 had the prior distribution 
TT, then after observing x the distribution of 9 becomes tt^- 

Next, TT and {Pq, 9 e 0} give a joint distribution for 9 and x: 

1.3.5 Proposition. For any measurable family {P^, 9 E 0} and prior tt on (0,T), there 
is a probability distribution Pr on (G x X, T® B) for which the marginal distribution on 
is TT and for each 9, the conditional distribution of x is Pq. 

Proof. For any A G T ® i5, let Pr(A) := f f 1^(9, x)dP0{x)d7T{9) if the integrals are 
defined. The collection of all sets A for which the integrals are defined contains all sets 
C X B for C G T and B G B. Thus it contains all measurable sets, as in the construction of 
product measures (RAP, Sec. 4.4). So Pr is well-defined and by monotone convergence is 
a countably additive probability measure onT<S)B. Clearly, tt is the marginal distribution 
of 9 for Pr and Po is a conditional distribution of x given 9. □ 

The marginal distribution of x for Pr, namely the law 7 on X having density 
/ f{9,x)d'K{9) with respect to v, is called the predictive distribution of x. For any B E B 
we have 



(1.3.6) 7(S) = j Pe{B)d'K{9). 

Next is an existence fact for posteriors: 

1.3.7 Theorem. For any dominated, measurable family {Pq, 9 e 0} and prior tt, we 
have < / f{9,x)d7r{9) < 00 for 7-almost all x, and the posterior tTx is well-defined. 

Proof. As noted above, / f{9,x)dTi{9) < 00 for t^-almost all x. If B E B and v{B) = 0, 
then PeiB) = for 7r-almost all 9, so ^{B) = by (1.3.6). So "f-almost" implies "7- 
almost" all x. 

LetD := {x : J f{9, x)d7i{9) = 0}. Then by (1.3.6), 

7(D) = j j f{9,x)dv{x)d7r{9) = J j f{9,x)d7r{9)dv{x) = 0. 

So the given inequalities are proved. To finish the proof it will be shown that the posterior 
distribution doesn't depend on the choice of the cj- finite dominating measure v. More 
precisely, if v and w are two such measures and tt^, tt^ the corresponding posteriors, it 
will be shown that tt^ = tt^ for 7-almost all x. 
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Here v + w will be another dominating measure, and v + is cr-finite since we can 
take Ai fl Bj with v{Ai) < oo and w{Bj) < oo. So we can replace w hj v + w. Applying 
the Radon-Nikodym theorem on sets where v and w are both finite, we get a measurable 
function dv/dw :— g > such that v{A) — J^gdw for all A E B. We also have 

dPg dPe dv 

dw dv dw 

almost everywhere for w. Thus in the definition of posterior, for a given x, both numerator 
and denominator are multiplied by g{x), so the posterior is unchanged if g{x) > 0. The 
set C where g = has v{C) = and so 7(C) = as desired, finishing the proof. □ 

The conditional risk of an action c & A, given x, is defined as 

ra;(7r,c) := J L{9,c)dTTx{e) 

if TTx exists, as we just saw it does for 7-almost all x. The next fact shows that decision 
rules are Bayes if they minimize the conditional risk for almost all observations. 

1.3.8 Theorem. If for a given measurable family {P0, 9 e 0}, prior tt and loss function 
L, a(-) is a decision rule such that for 7-almost all x, 



(1.3.9) rx{n,a{x)) = inf{ra;(7r, c) : ceA}, 

and if there exists a rule e(-) with finite risk, then a(-) is a Bayes rule, and any Bayes rule 
b{-) in place of a(-) also satisfies (1.3.9). 

Proof. Applying Theorem 1.3.7, let exist and (1.3.9) hold ior x ^ B where 7(-B) = 0. 
Then by (1.3.6), PeiB) = for 7r-almost all 9. From the definitions and the Tonelli-Fubini 
theorem, for any decision rule b{-), 

(1.3.10) 

r{n,b) = I I L{9,b{x))f{9,x)dv{x)dn{9) = I [ L{9,b{x))f{9,x)dn{9)dv{x). 

J Jx\B Jx\B J 

Given x, minimizing / L{9 , c) f {9 , x)dn{9) with respect to c is equivalent to minimizing 

r,(7r,c) = J L{9,c)f{9,x)dn{9)/ J f{i;,x)dn{i;), 

since / /(-!/;, x)dT:{ip) is strictly positive and finite and doesn't depend on c. So a{x) achieves 
this minimum for x ^ B. Thus for any decision rule 6(-), 



(1.3.11) J L{9,a{x))f{9,x)d7r{9) < J L{9,b{x))f{9,x)dn{9) 
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for X ^ B. Taking !x\b of both sides and applying (1.3.10) gives r(^,a(-)) < r{6^h{-)). 
Taking = e(-) we see that the minimum risk, which a(-) achieves, is finite. In other 
words, a(-) is Bayes. If is also Bayes, then r(7r,6(-)) = r(7r,a(-)). Thus the inequality 
in (1.3.11) must be an equality for f-almost and so 7-almost all x, and (1.3.9) must hold 
for h{-) in place of a(-) and for 7-almost all a;. □ 

If A is finite, then any function on A attains its minimum, so Bayes rules always exist. 
They may not when A is infinite, as was mentioned in the last section for decision problems 
without a sample space: 

1.3.12 Example. For A infinite, a Bayes rule need not exist even if X is a singleton, say 
X = {0}, so that an observation makes no difference and {Pg, 6* G ©} reduces to the single 
law {do}. For example let A be the set of positive integers and let L(m) := L(5o,m) := 
1/m for m = 1, 2, . . . . Then the infimum of risks is but it is not attained by any decision 
rule. 

The following will not be hard to prove: 

1.3.13 Proposition. For any dominated measurable family {Pg, G 0} of laws on a 
sample space {X, B) and prior tt on G, the posterior distribution tTx given a; is a conditional 
distribution of 9 for Pr (defined in Proposition 1.3.5) given x. 

Proof. From the proof of Proposition 1.3.5, Pr has a density f{9,x) with respect to 
IT X V. Thus for any C eT, 7r(C) — fx /c /(^; ^)'^^(^)'^^(^)- Multiplying and dividing 
by / f{ip,x)d7r{'ip), which by Theorem 1.3.7 is strictly positive and finite for 7-almost all 
X, we get 

7r(C) = / 7r,(C) / f{i;,x)d7r{i;)dv{x) = [ 7r,{C)d^{x) 

JX J& JX 

since 7 is the X marginal of Pr. It follows that a conditional distribution of 6 given x for 
Pr is the posterior distribution tTx- □ 

PROBLEMS 

1. li V = {P, Q} as in the Neyman-Pearson situation, and tt is a prior with 7r(P) = p = 
1 — q, find the posterior probabilities given x in terms of p, q and Rq/p{x). 

2. Suppose the action space A is countable and there is a Bayes randomized decision rule 
d such that for each x, we have d{x){a) > for every a E A. Then, show that every 
randomized decision rule is Bayes. 

3. Let X be the Cartesian product of n copies of (0, 1} (the vertices of the unit n-cube) 
and let Pg be the product of n copies of the law with probability ^ at 1 and 1 — ^ at for 
^ G O = [0, 1]. In other words, suppose we have n independent trials with probability 9 
of success. Suppose that the prior for 9 is the uniform distribution on < ^ < 1. If the 
observations (in other words, the coordinates of "the" observation) consist oi k I's and 
n — k O's, find the posterior distribution of 9. 
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